RANDOM PERTURBATIONS OF STOCHASTIC CHAINS WITH 
UNBOUNDED VARIABLE LENGTH MEMORY 



PIERRE COLLET, ANTONIO CALVES, AND FLORENCIA G. LEONARDI 

Abstract. We consider binary infinite order stochastic chains perturbed by a random noise. 
This means that at each time step, the value assumed by the chain can be randomly and 
independently flipped with a small fixed probability. We show that the transition probabili- 
ties of the perturbed chain are uniformly close to the corresponding transition probabilities 
of the original chain. As a consequence, in the case of stochastic chains with unbounded but 
otherwise finite variable length memory, we show that it is possible to recover the context 
tree of the original chain, using a suitable version of the algorithm Context, provided that 
the noise is small enough. 



1. Introduction 

The original motivation of this paper is the following question. Is it possible to recover the 
context tree of a variable length Markov chain from a noisy sample of the chain. We recall 
that in a variable length Markov chain the conditional probability of the next symbol, given 
the past depends on a variable portion of the past whose length depends on the past itself. 
This class of models were first introduced by Rissanen (1983) who called them finite memory 
sources or tree machines. They recently became popular in the statistics literature under the 
name of variable length Markov chains (VLMC) (Biihlmann and Wyner; 1999). 

The notion of variable memory model can be naturally extended to a non markovian 
situation where the contexts are still finite, but their lengths are no longer bounded (see for 
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example Ferrari and Wyner (2003), Csiszar and Talata (2006) and Duarte et al. (2006)). 
This leads us to consider not only randomly perturbed unbounded variable length memory 
models, but more generally randomly perturbed infinite order stochastic chains. 

We will consider binary chains of infinite order in which at each time step the value assumed 
by the chain can be randomly and independently flipped with a small fixed probability. Even 
if the original chain is markovian, the perturbed chain is in general a chain of infinite order 
(we refer the reader to Fernandez et al. (2001) for a self contained introduction to chains of 
infinite order) . We show that the transition probabilities of the perturbed chain are uniformly 
close to the corresponding transition probabilities of the original chain. More precisely, we 
prove that the difference between the conditional probabilities of the next symbol given a 
finite past of any fixed length is uniformly bounded above by the probability of flipping, 
multiplied by a fixed constant. This is the content of our first theorem. 

Using this result we are able to solve our original problem of recovering the context tree 
of a chain with unbounded variable length from a noisy sample. To make this point clear, 
we must explain the notion of context. In his original 1983 paper, Rissanen used the word 
context to designate the minimal suffix of the string of past symbols which is enough to define 
the probability of the next symbol. Rissanen also observed that this notion is interesting only 
if the set of all contexts satisfies the suffix property, which means that no context is a proper 
suffix of another context. This property allows to represent the set of all contexts as the set 
of leaves of a rooted labeled tree, henceforth called the context tree of the chain. With this 
representation the process is described by the tree of all contexts and an associated family of 
probability measures on A, indexed by the leaves of the tree. Given a context, its associated 
probability measure gives the probability of the next symbol for any past having this context 
as a suffix. 

Rissanen (1983) not only introduced the class of variable memory models but he also 
introduced the algorithm Context to estimate both the context tree and the associated family 
of probability transition. The way the algorithm Context works can be summarized as follows. 
Given a sample produced by a chain with variable memory, we start with a maximal tree of 
candidate contexts for the sample. The branches of this first tree are then pruned until we 
obtain a minimal tree of contexts well adapted to the sample. 
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From Rissanen (1983) to Galves et al. (2006), passing by Ron et al. (1996) and Biihlmann 
and Wyner (1999), several variants of the algorithm Context have been presented in the 
literature. In all the variants the decision to prune a branch is taken by considering a 
gain function. A branch is pruned if the gain function assumes a value smaller than a 
given threshold. The estimated context tree is the smallest tree satisfying this condition. 
The estimated family of probability transitions is the one associated to the minimal tree of 
contexts. 

Rissanen (1983) proved the weak consistency of the algorithm Context when the tree of 
contexts is finite. Biihlmann and Wyner (1999) proved the weak consistency of the algorithm 
also in the finite case without assuming a prior known bound on the maximal length of the 
memory but instead using a bound allowed to grow with the size of the sample. In both papers 
the gain function is defined using the log likelihood ratio test to compare two candidate trees 
and the main ingredient of the consistency proofs was the chi-square approximation to the 
log likelihood ratio test for Markov chains of fixed order. 

The unbounded case was considered by Ferrari and Wyner (2003), Duarte et al. (2006), 
Csiszar and Talata (2006) and Leonardi (2007). The first two papers essentially extend to 
the unbounded case the original chi-square approach introduced by Rissanen. Instead of the 
chi-square, the last two papers use penalized likelihood algorithms, related to the Bayesian 
Information Criterion (BIC), to estimate the context tree. We refer the reader to Csiszar and 
Talata (2006) for a nice description of other approaches and results in this field, including 
the context tree maximizing algorithm by Willems et al. (1995). 

In the present paper we use a variant of the algorithm Context introduced in Galves et al. 
(2006) for finite trees and extended to unbounded trees in Galves and Leonardi (2007). In this 
variant, the decision of pruning a branch is taken by considering the difference between the 
estimated conditional probabilities of the original branch and the pruned one, using a suitable 
threshold. Using exponential inequalities for the estimated transition probabilities associated 
to the candidate contexts, these papers not only show the consistency of this variant of the 
algorithm Context, but also provide an exponential upper bound for the rate of convergence. 

This version of the algorithm Context does not distinguish transition probabilities which 
are closer than the threshold level used in the pruning decision. Our first theorem assures that 
this is what happens between the conditional probabilities of the original variable memory 
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chain and the perturbed one, if the probability of random flipping is small enough. Hence it 
is natural to expect that with this version of the algorithm Context, one should be able to 
retrieve the original context tree out from the noisy sample. This is actually the case, as we 
prove in the second theorem. 

The paper is organized as follows. In section 2 we give the definitions and state the main 
results. Section 3 and 4 are devoted to the proof of Theorem 1 and 2, respectively. 

2. Definitions and results 

Let A denote the binary alphabet {0, 1} with size \A\ = 2. Given two integers m < n, we 
will denote by vf^ the sequence (w m , . . . ,w n ) of symbols in A. The length of the sequence 
uf^ is denoted by £(w^) and is defined by f(a)J},) = n — m + 1. Any sequence with m > n 
represents the empty string and is denoted by A. The length of the empty string is £(X) = 0. 

Given two sequences w and v, we will denote by vw the sequence of length £{v) + £(w) 
obtained by concatenating the two strings. In particular, Xw = wX = w. The concatenation 
of sequences is also extended to the case in which v denotes a semi-infinite sequence, that is 

We say that the sequence s is a suffix of the sequence w if there exists a sequence u, with 
£(u) > 1, such that w = us. In this case we write s ~< w. When s ~< w or s = w we write 
s ^ w. Given a sequence w we denote by suf(w) the largest suffix of w. 

In the sequel A^ will denote the set of all sequences of length j over A and A* represents 
the set of all finite sequences, that is 

oo 

A* = [j AK 

3=1 

Definition 2.1. A countable subset T of A* is a tree if no sequence s G T is a suffix of 
another sequence w G T . This property is called the suffix property. 

We define the height of the tree T as 

£(T) = sup{£(w) :weT}. 

In the case £(T) < +oo it follows that T has finite cardinality. In this case we say that T 
is bounded and we will denote by |T| the number of sequences in T. On the other hand, if 
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£(T) = +00 then T has a countable number of sequences. In this case we say that the tree 
T is unbounded. 

Given a tree T and an integer K we will denote by T\k the tree T truncated to level K, 
that is 

T\k = {w G T: £(w) < K} U {w: l(w) = K and w -< u, for some u G T}. 

We will say that a tree is irreducible if no sequence can be replaced by a suffix without 
violating the suffix property. This notion was introduced in Csiszar and Talata (2006) and 
generalizes the concept of complete tree. 

Definition 2.2. A probabilistic context tree over A is an ordered pair (T,p) such that 

(1) T is an irreducible tree; 

(2) p = {p(-\w);w G T} is a family of transition probabilities over A. 

Consider a stationary stochastic chain {X t : t <G Z} over A. Given a sequence w G A? we 
denote by 

p(w) = F(X{ = w) 

the stationary probability of the cylinder defined by the sequence w. If p(w) > we write 

p(a\w) = F(X = a | Xz) = w) . 

Definition 2.3. A sequence w G is a context for the process {A t :t£Z} if p(w) > and 
for any semi-infinite sequence xZ 1 ^ such that w is a suffix of xZ 1 ^ we have that 

P(A = a I Xzlc = xZlo) = p(a\w), for all a e A, (2.4) 

and no suffix of w satisfies this equation. 

Definition 2.5. We say that the process {X t : t G Z} is compatible with the probabilistic 
context tree (T,p) if the following conditions are satisfied 

(1) w G T if and only if u> is a context for the process {X t : t G Z}. 

(2) For any u> G T and any a G A, p(a|io) = F(X = a | ^Z\ w \ = w )- 

In the unbounded case, the compactness of A z assures that there is at least one stationary 
stochastic chain compatible with a probabilistic context tree. The uniqueness requires further 
conditions, such as the ones presented in Fernandez and Galves (2002). 
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Definition 2.6. A probabilistic context tree (T,p) is of type B if it satisfies the following 
conditions 



(1) Non-nullness, that is 

(2) Log-continuity, that is 



a : = inf p(a\w) > 0; 
w£T,a£A 



(5k — > when k — > oo, 
where the sequence {/3fc}fc<=N is defined by 

(5 k := sup{|l - : a G A,v,w G T with io = u}. 

A; 

Here, w = v means that there exists a sequence u, with £(u) = k such that u < w 
and u < v. The sequence {/3fe}fceN is called the continuity rate. 

For a probabilistic context tree of type B with summable continuity rate, the maximal 
coupling argument used in Fernandez and Galves (2002) implies the uniqueness of the law of 
the chain consistent with it. Then, we will assume here that the continuity rate is summable, 
that is 

P:=^2Pk < +oo. (2.7) 

fceN 

This condition immediately implies that 

+oo 

1 < P* :=H(l + k ) < +oo. 

fc=0 

Given an integer k > 1 we define 

D k = min m&x{\p(a\w) — p(a\sui(w))\}. (2-8) 

w€T-.e(w)<k aeA 

In this paper we are interested on the effect of a Bernoulli noise flipping independent from 
the successive symbols produced by the chain. Namely, let : t G Z} be an i.i.d. sequence 
of random variables taking values in the alphabet A, independent of {X t : t G Z}, with 

= 0) = 1 - e, 

where e is a fixed noise parameter in [0, 1]. For a and b in A, we define 

a@b = a + b (mod 2), 
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and a = 1 © a. We now define the stochastically perturbed chain {Z t : t € Z} by 

z t = x t ®i t . 

The process {Zt: t € Z} is an example of a hidden Markov model. In the case e = 1/2, 
{Zt : £ € Z} is an i.i.d. uniform Bernoulli. However, in general it is not a chain of finite order. 
We will use the shorthand notation 

q(w{) = F(Z{ = w{) 

and 

q{a\wZ)) = V{Z = a \ Zz) = wZ)) 
to denote the probabilities corresponding to the process {Z t : t £ Z}. We also define 

q k = min{ q(w) : i(w) < k and q(w) > }. (2.9) 

We can now state our first theorem. 

Theorem 1. Assume the chain {X t : t € Z} has summable continuity rate. Then, for any 
e £ [0, 1] and for any j > 

4/3/3* 

sup \F(Z Q = w | ZZ) = wZ)) - P(X = w | XZ) = wZ)) | < (1 + 



le. 



Q: 



To state the second theorem we first need to present the version of the Algorithm Context 
introduced in Galves et al. (2006) and Galves and Leonardi (2007). 

In what follows we will assume that z±, Zi . . . , z n is a sample of the observed chain {Zt: t € 
Z} and that the underlying chain {X t : t £ Z} is compatible with the probabilistic context 
tree (T,p). 

For any finite string w with l(w) < n, we denote by N n (w) the number of occurrences of 
the string in the sample; that is 

n— i(w) 

N n (w)= £ l{z^ w) = W }. (2.10) 
t=o 

For any element a G A and any finite sequence w G A* , the empirical transition probability 
q n (a\w) is defined by 

N n (wa) + 1 

9 " (aH = N n ( W .) + \A\ - (2 ' n) 
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where 

b&A 

This definition of q n (a\w) is convenient because it is asymptotically equivalent to ^"^"j 1 and 
it avoids an extra definition in the case N n (w-) = 0. 

The variant of Rissanen's algorithm Context we will use is defined as follows. First of all, 
let us define for any finite string w G A* : 

A n (w) = max \q n (a\w) - q n (a\sui(w))\. 

The A n (w) operator computes a distance between the empirical transition probabilities as- 
sociated to the sequence w and the one associated to the sequence suf(iu). 

Definition 2.12. Given 5 > and d < n, the tree estimated with the algorithm Context is 

f^ d = {weAf: A n (w) > 5 and A n (uw) < 5 € A d ~ e{w) }, 
where A\ denotes the set of all sequences of length at most r. In the case l(w) = d we have 

A d-t(w) = ^ 

A 8 d 

It is easy to see that T n ' is a tree. Moreover, the way we defined q n {-\-) in (2.11) associates 

^ S d 

a probability distribution to each sequence in T n ' . 
We may now state our second theorem. 

Theorem 2. Let K be an integer and let z\, z<i . . . , z n be a sample of the perturbed chain 
{Z t : t £ Z}. Then, for any d satisfying 

d> max min {£(v ) : v G T, v >z w}, (2-13) 

weT\ K 

for any 5 such that 2(1 + ^§-)e < 5 < Dd — 2(1 + ^f-)e and for any n such that 

4(|A| + 1) 

n > 7737s 1- d 

[mm{6,D d -S)-2e(l + ^-)]q d 

we have 

m ,n K) <_ a m + i) |,r - «p[-(. - *> l ^ iS Z^Z*+f ^ • 

As a consequence we obtain the following strong consistency result. 
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Corollary 1. For any integer K and for almost all infinite sample z±,Z2 ■ ■ ■ there exists a h 
such that, for any n > n we have 

f^ d \ K = r\ K , (2.i4) 

where d is given by (2.13) and S is such that 2(1 + ^-)e < 5 < D d - 2(1 + ^)e. 

3. Proof of Theorem 1 
We start by proving three preparatory lemmas. 

Lemma 3.1. For any k > j > and any e £ [0, 1] we have 

sup \P(X = wo | XZ] = wZp X-j-i = a, Z_j_i = b, ZZ{~ 2 = wZ j k ~ 2 ) - p{w \ wZ^) \ 

W -oo' a > b 

< Pi- 

Proof. We observe that for j > it follows from the independence of the flipping procedure 
that 



¥(X =w | XZ) = wZ),X-j- X = a, Z-j-i = b, ZZf 2 = wZ{~ 2 ) 

_ E uZr nXZ k = u j *,nr ) l r (l )s(Z j 1 = w^b | Xlf 1 = g^o) 
HXZI = u'J k ' 2 awZ))nZZl l = wZl 2 b | XZl 1 = uZ{' 2 a) ' 
It is easy to see using conditioning on the infinite past that 

inf F(X = w | XZ) = W Zj,XZ^ = vZ^ 1 ) 

< F(X = w | XZl = vTj k ~ 2 awZ)) 

< sup ¥(X = wo | XZ) = wZlXZiZ 1 = vZ j ~ 1 ). 

— 7 — 1 

— DO 

Then, using continuity we have 

p{ w o I wZlo) - < W(X = w | XZ\ = uZi^awZ]) < p{w \ uT^) + (3j 
and the assertion of the Lemma follows immediately. □ 
Lemma 3.2. For any e € [0, 1] and for any k > we have 

inf ¥(Z = wo | ZZ\ = wZ\) > a, 
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and 

inf ¥(X = w | ZZ l k = wZl) > a. 
Moreover, for any < j < k we have 

inf P{X. j . 1 = | XZ) = wZ], ZZ{~ 2 = wZi' 2 ) > 

w_ k P 

Proof. We first observe that 

P(Z = w | = «Tj[) = (1 - e) F(X = w | ZZI = wZl) +e¥(X = w \ Zz\ = wZ\). 

It is therefore enough to prove the second assertion. From the independence of the flipping 
procedure we have 

P(X = ™o | ZZl = wZl) = 

(1 " e) k E„-ipH I uZjvuZ^MXz! = uZ] | XZ 1 ^ 1 = wZ 1 ' 1 )^ - e )jZJ=-u<»*»» 

lim — 

(1 - ef E u -_l HXZI = uZ] | XZ 1 ^ 1 = wZ 1 ' 1 ) (6/(1 - 

> a. 

For the last assertion we first observe that 

nzzr = wzr i x:r 2 = ggg^ = ^j-^cr = 

£, =r p(*X 2 = -X 2 1 xzr 2 = x:r 2 )F(xr) = w Z ],xzi 2 - ^~ 2 

Moreover, 

p(^-i = ^}_i^T 2 = xT 2 ) _ 



X 



(xz] = wz),x 



-fc 



X 



ng p(x_, = ^ xzjij = xzi 2 = xzi 2 ) ut J+ 2 = ^x-t 1 = x-k 1 ) 
nf =1 p(x_, = xij- 1 = «c}-\ xx 2 = x zr 2 ) n? =i+2 p(x-, = jrj- 1 = zX 1 ) 

> F(X- j - 1 = w- j -i\X_ 3 k 2 =x_{ 2 ff^ — riz 

f} 1 W{X_ l = w- l \XZ l - 1 = wZ l -\X_l 2 = xJ k 2 ) 

and using non-nullness and log-continuity this is bounded below by 



J 1 

nl a 
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This finishes the proof of the Lemma. □ 
Lemma 3.3. For any k > j > and any e <G [0, 1] 

sup = uL^i I JTj = wZl ZZl 1 = wS 1 ) <^-e. 

w°_ k 

Proof. We have 

_ p ( x -j-i = ^-j-i. = w-j-i I x -j = w -j> zzl 2 = w ~X 2 ) 
Hz-i-i = v-i-i I -v ) = «• ;. ^X 2 = «• i 2 ) 

p(z_ J -_ 1 = «,_,-_! i x- 1 = z:r 2 = u,:f 2 ) ' 

It follows from Lemma 3.2 that 

H z -i-i = w -j-i I x -) = w -h Z X 2 = w X 2 ) 

= (i - 6) ¥(x^ = w-j-! | xz) = wz), zzi' 2 = w -i' 2 ) 

+ e p(*-*-i = I = «c}, zzl 2 = ™T 2 ) 

> — 

~ 0* 

This concludes the proof of Lemma 3.3. □ 
Proof of Theorem 1. We first observe that 

[Z = w | ZZl = w -l) = (1 - e ) P (*o = «>o | ^ = «qt) + ^(^o = | ZZl = w -l) ■ 



Therefore, 

\¥{Z = wo | ZZl = vZ\) ~ P{X = wo | ZZl = "ZD | < e 
and if A; = the Theorem is proved. We will now assume k > 1 and we write 

P(X = ti* | ^ =«rj) - ¥{X = wo | = wZl) 

k—l 

= EM X o = wo\XZj = wZ 1 j ,ZZi- 1 = wZr 1 ) 

3=0 

- ¥{X = wo | XZU = wZU, ZZl 2 = ™Zl 2 )] • 
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We will bound each term in the sum separately. We can write 

P(X =w | XZ] = wZ], ZZl 1 = wZr 1 ) - F{X = I XZU = wZU, ZZ{- 2 = wZr 2 ) 

= E [W(Xo = w \Xz}=wZ},X^ l = b, Z'X 1 = w~X X ) 
6e{o,i} 



x = Wo | xzU = «cU zzl 2 = wzi 2 )] 



x *{ x -i-i = 6 I XZ) = «C}, Z I 1 = .J 1 ). 

The above sum is a sum of two terms, one with b = w-j-u the other one with b = w-j-u 
We will bound above these two terms separately. For the first term we have the bound 

\F(X =w | XZ] = wZ), X-j-i = w-j-u zzr 1 = wit 1 ) 

- p(x = w 1 xz)_, = wz). x , zz{- 2 = w-j- 2 ) | 



x P(X^ = w-^ | XZ) = wZ), Z i ' = ,, i ) < e 



from Lemma 3.1 and Lemma 3.3. For the other term we can write 

\F(X = w | XZ) = wZ), X_j_ x = w-j- u ZZl' 1 = wZr 1 ) 

- P(X = wo | xzU = wz)_ ± , zzf 2 = wz{r 2 [ 
x p(x_,_ 1 = I xz) = wz), zzl 1 = ^T 1 ) 

< E \HX = w \XZj=wZ 1 j ,X- j - 1 = w-j-u ZZ{- 1 = wZ j k - 1 ) 

ae{0,l} 

- ¥{X = w | XZ)_, = wZ)_ v Z_j_, = a, ZZl 2 = w -l~ 2 ) I 

= a I X -)-i = «C}-i> zzi 2 = w -i 2 ) 

P^-i = w-j-! | XZ) = wZ), ZZl 1 = v-l 1 )- 
Using the fact that the term in the sum with a = w-j-i vanishes this is bounded above by 

[Xo = w | XZ) = wZ), X-j-! = w-j-u ZZl 1 = "T 1 ) 

- P(X = w | XZ)_i = wZ)-u z -j-i = w-j-i, ZZ{Z 2 = wZ j k 
F(Z- j - 1 = w- j - 1 \XZ)_ 1 =wZ 1 - 1 ,ZZt 2 = wZi- 2 ) < 2/3,6 



x 

X 
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from Lemma 3.1. Putting all the above bounds together we get 

2BB* 

|P(Z = w | ZZI = wZl) - P(^o = | XZ l k = wZl) | < 6+ -t^-e + 2(5e 
and the Theorem follows. □ 

4. Proof of Theorem 2 
We start by proving four new auxiliary Lemmas. 

Lemma 4.1. For any i > 1, any k > i, any j > 1 and any finite sequence w\, the following 
inequality holds 

sup inZ^' 1 = w{ I X\ = x\, £ = 6\) - q(wi)\ < j k -i-l ■ 

Proof. Observe that for any x\, 6\ <G A 1 

\nzt +] - 1 = w{ | X\ = x{, Q = 6\) - q(w{)\ 

= I E = Z^' 1 = w{ I X\ = xU\ = 9\) - q{w{)\ 

x k +j -i eAj 

E nzi^- 1 = w\ i xt*- 1 = x^-'nx^- 1 = x^"- 1 1 = *ui = e\) 



x k+j-i eAJ 



-q(w{)\ 

by the independence of the flipping procedure. The last term can be bounded above by 

- e nz^- 1 = 4 1 x™- 1 = i = 1 *i = *i) 



k+j — 1 k+j— 1\ 



Then, we can use Lemma 3.6 in Galves and Leonardi (2007) to bound above the last sum 
with 



E jPk-i-iHx^-^x^- 1 ) 



x k+j-l eAj 

We conclude the proof of Lemma 4.1. □ 
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Lemma 4.2. For any finite sequence w and any t > we have 
¥(\N n (w) - (n-£(w) + l)q(w)\ > t) < e«exp[ 



14 



4e(l + (3)£(w)(n - £(w) + !)■ 



Moreover, for any a G A and any n > ^|^y + we have 



\q n (a\w)-q(a\w)\ > t) < 



n A\ , in i r / ^ \ , 1N ^ (^(LjflMtu)] 2 ^^) + n^'(J>fJ 2 i 

(|A| + l)c exp[-(n-<(«,) + l) 16e(1 + /3) | A | 2(£W + 1) ] • 

Proof. Observe that for any finite sequence w{ G A 7 

^K) = E II [ 1 {jft+i=t«i} 1 ««+i=o} + 1 {x t+i =«j i } 1 te+i=i}]- 

t=0 i=l 

Define the process {L^ : t G Z} by 

j 

^ = II [ 1 {- y t+i-i=toi} 1 tt«+i-i=o} + ^Xt+i-i^i^tft+i-^i}] 
i=i 

and denote by .Mi the cr-algebra generated by C/i, . . . , i/j. Applying Proposition 4 in Dedecker 
and Doukhan (2003) we obtain that, for any r > 2 



\\N n (w{) - (n - j + l)q(w{)\\ r 

n-j+l t 

\ 2r E Jf^E^i^iii) 



< 



< 



< 



< 



< 



1=1 

n-j+l 



k=i 



n— j+1 



m(Uk\M t )\\oo) 



2r £ 

i=l k=i 
n-j+l n-j+l i 

2r E E su p iE(w = *i)i) 3 
i=i k=i °\^ a% 

n—j+l n— j+1 

2r E E su p iE(c/ fe |x{ = xi,ei = ^i)i 

i=l k =i x\,9\eAi 
n—j+l n—j+l 

2r E E su p iP(^" 1 = ^ii^ = ^ei = ^i)-gK)i) 

i=l k=i x\,6\eA' 



Using Lemma 4.1 we can bound above the last expression by 

[2r(l + 0)£(w)(n-j + l)]v. 
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Then, as in Galves and Leonardi (2007) we obtain 

F(\N n (w) - (n -£{w) + l)q{w)\ > t) < e=exp[ 



L 4e(l +0)£(w) (n-£(w) + l) ] 
and 

P(|g n (a|«;)-g(o|«;)| > t) < 

n t\ i i\ — r t 01 \ , a\ ^ ~ (n-e(wyti)q(w)} 2 \-l( w } + nJ(l}a^i 
(\A\ + l)ee exp[-(n-<(«,) + l) i^ppT^ ] ■ 

This concludes the proof of Lemma 4.2 
Lemma 4.3. For any S > 2(1 + ^^-)e, for any 

n> , 2(|A| +1> +rf 



and /or any w £ T, uw £ 7jf' d we /icwe f/iaf 



P(A n (nw) > (5) < 2|A|(|A| + l)eeexp[-(n-d) 



32e(l+/3)|,4| 2 (d+l) J ' 
Proof. Recall that 

A n (-uu>) = max |<7 n (a|mu) — g n (a|suf(uu;))|. 

aeA 

Note that the fact w G T implies that for any finite sequence u and any symbol a 6 4 
have p(a|mu) = p(a|suf(mu)). Hence, 

|<7 n (a|uw;) — q n (a\su£(uw))\ < \q n (a\uw) — q(a\uw)\ + \q(a\uw) — p(a\uw)\ 

+ \q(a\sui(uw)) — p(a\suf(uw))\ 
+ |p n (a|suf(uiw)) — g(a|suf(uiy))|. 

Then, using Theorem 1 we have that 

F(A n (uw) >6) < Yl [V(\q n (a\uw) - q(a\uw)\ > 5 - - e(l + -^-)) 

+ ¥(\q n (a\suf(uw)) - q(a\suf{uw))\ > - - e (1 + -^-))] . 

_■ GL 

Now, for 
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we can bound above the right hand side of the expression above using Lemma 4.2 by 



for any 



Lemma 4.4. For any d satisfying (2.13), for any 5 < D d — 2e(l + ^§-), fo 

„> — m±v — +d 

and for any w € 7~n' d with i{w) < K we have that 

\Dru lAniuw) £ s}) i2m + 1>e ' exp[ ~ < "~ <!) |C ^('V"S"+T)' 

Proof. As d satisfies (2.13) we have that there exists a uw G T\d such that ira; G T. Then 

P( P| {A n (« W ) < 5}) < ¥(A n (uw) < 5). 

uweT\ d 

Observe that for any a£i, 

\q n (a\sui(i[w)) — q n (a\uw)\ > \p(a\snf(uw)) — p(a\uw)\ — \q n (a\suf(uw)) — q(a\sui(uw))\ — 

\q n (a\uw) — q(a\uw)\ — \q(a\sui(uw)) — p(a\sui(uw))\ — 
\q(a\uw) — p{a\uw)\. 

Hence, we have that for any a £ A 

4/3/3* 

A n (uw) > Dd — 2e(l H ) — \q n (a\sui(uw)) — q(a\sui(uw))\ — \q n (a\uw) — q(a\uw)\. 

a 

Therefore, 

D - 2e(l + 4/3/3 * ) - S 
¥(A n (uw) <S) < P( f] { |g n (a|suf(ti«/)) - g(o|suf(tnw))| > — — — ^ } ) 

aGA 

+ P( f| { \q n {a\uw) - q(a\uw)\ > D d~H^ + ^)-S } y 

aGA 

As 5 < D d - 2e(l + and 

„ > 4 ^l + 1 ) + d 

(D d -2e(l + ^)-% d 
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we can use Lemma 4.2 to bound above the right hand side of the inequality above by 

This concludes the proof of Lemma 4.4 □ 



Now we proceed with the proof of our main result. 



Proof of Theorem 2. Define 



nf = U U {*»(««/)>«}, 

weT , 

[w)<K 



and 



U n {^n(uw)<6}. 

t(w)<K 



Then, if d < n we have that 



{T*?\ K + T\k} = <%f U U n 



s ■ 



Using the definition of O^'f and U^f we have that 

° n,o n,o 



F(ty\ K + t\ K ) < e e n A «M > + e p ( n a «^) ^ 



Applying Lemma 4.3 and Lemma 4.4 we obtain, for 

n > !M±1) + d 

[wm(6,D d -6)-2e(l + &g-)]q d ' 

the inequality 

WrMi JTI U4rn4Uni/(|W«n[ f n [min(fr ZJ d - (5) - 2e(l + ^)] 2 ^ n 
P(V <4e.(|A| + l)|A| exp[-(n-d) 256e(1 +/9) | A |2 (d+ 1} J- 

We conclude the proof of Theorem 2. □ 



Proof of Corollary 1. It follows from Theorem 2, using the first Borel-Cantelli Lemma and 
the fact that the bounds for the error estimation of the context tree are summable in n for a 
fixed d satisfying (2.13). □ 
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